**Housekeeping
clear all
cap log close
log using ${rep_root}/logs/read_complaints_cms.log, text replace
set more off

**INCIDENTS
**Read the raw incident data
import delimited using ${rep_root}/data/FOIA/FOIA_complaints-recent_05212021/cms_case_info.csv, varnames(1)

**Extract matching cr_id from log number
rename *log_no log_no
gen log_yr = substr(log_no, 1, 4)
gen log_id = substr(log_no, 6, 7)
gen cr_id = log_yr + log_id
destring cr_id, force replace
drop if missing(cr_id)

**Record start date, complaint date, and closed date
gen inc_start_dt = date(incident_date, "DMY", 2025)
gen inc_complaint_dt = date(complaint_date, "MDY", 2025)
gen inc_closed_dt = date(closed_date, "MDY", 2025)

**Mark that complaint brought by civilian
gen inc_civilian = complainant_type=="Civilian"

**Mark if complaint is related to a police shooting
gen inc_shooting = police_shooting == "Yes"

**Record police district and beat
destring beat_of_incident, force gen(inc_beat)
gen inc_district = floor(inc_beat/100)

**Unify non-chicago "districts" under 31 and set to missing if outside expected values
replace inc_district = 31 if inc_district==41
replace inc_district = . if !(inrange(inc_district, 1, 25) | inc_district==31)

**Run a simple program that categorizes the allegation code
do ${rep_root}/code/cat_prg_cms.do

**Mark if complaint category contains the word "domestic"
gen inc_any_domestic_i = strpos(allegation_category, "Domestic")>0

**Standardize how addresses are written in the data
gen inc_address = street_no + " " + upper(street_direction) + " " + upper(street_name)
gen inc_city = upper(city)
rename state inc_state
replace inc_state = "Illinois" if inc_state=="IL"
replace zip_cd = substr(zip_cd, 1, 5)
destring zip_cd, force gen(inc_zip)

**Keep only relevant variables and drop duplicate records with respect to these variables
keep cr_id inc_* investigating_agency

duplicates drop

**Some CR IDs appear more than once in the incident data because there is more than one incident
**associated with the complaint. We will organize these multiple incident reports wide to ensure the
**final dataset is unique by CR ID

**First sort by all the variables in the dataset, prioritizing cases that involve a shooting, then earlier cases
gsort cr_id inc_shooting inc_start_dt inc_complaint_dt inc_closed_dt inc_civilian inc_beat inc_address inc_city inc_state inc_zip
by cr_id: gen idx = _n
by cr_id: gen inc_tot = _N

**Record if any of the underlying incidents are domestic
by cr_id: egen inc_any_domestic = max(inc_any_domestic_i)
drop inc_any_domestic_i

**Keep only the first 4 incidents
drop if idx>4

**Reshape the data wide within a complaint
rename inc_* inc_*_
rename inc_tot_ inc_tot
rename inc_any_domestic_ inc_any_domestic
reshape wide inc_*_, i(cr_id) j(idx)

**Save incident data
save ${rep_root}/data/inc_cms, replace
clear

**VICTIMS
**Read raw victim data
import delimited using ${rep_root}/data/FOIA/FOIA_complaints-recent_05212021/cms_subject.csv, varnames(1)

**Extract matching cr_id from log numbe
rename *log_no log_no
gen log_yr = substr(log_no, 1, 4)
gen log_id = substr(log_no, 6, 7)
gen cr_id = log_yr + log_id
destring cr_id, force replace
drop if missing(cr_id)

**Record indicator that victim is male
gen vic_male_ = gender=="Male" | gender=="M"
replace vic_male_ = . if missing(gender) | gender=="NULL"

**Record simple race category (black, white, other)
gen vic_race_ = .
replace vic_race_ = 1 if inlist(race, "WHI", "WWH", "White")
replace vic_race_ = 2 if race=="Black or African American"
replace vic_race_ = 3 if !missing(race) & race!="NULL" & missing(vic_race_)

**Record victim's birth year and require it be in a reasonable range
destring birth_year, force gen(vic_byr_)
replace vic_byr_ = . if !inrange(vic_byr_, 1900, 2020)

**Categorize injury codes
gen vic_inj_ = .
replace vic_inj_ = 0 if inlist(injury_condition, "NULL", "No Injury")
replace vic_inj_ = 1 if injury_condition=="Non Seriously Bodily Injury"
replace vic_inj_ = 2 if injury_condition=="Serious Bodily Injury"
replace vic_inj_ = 3 if injury_condition=="Fatal"

**Check for any female victims across all victims in the complaint
sort cr_id
gen vic_female_i = vic_male_==0 & !missing(vic_male_)
by cr_id: egen vic_female_any = max(vic_female_i)
drop vic_female_i

**Check for any white victims across all victims in the complaint
gen vic_white_i = vic_race_==1 & !missing(vic_race_)
by cr_id: egen vic_white_any = max(vic_white_i)
drop vic_white_i

**Record age of the oldest and youngest victim in the complaint
by cr_id: egen vic_byr_oldest = max(vic_byr_)
by cr_id: egen vic_byr_youngest = min(vic_byr_)

**Restrict to just cr_id created variables
keep cr_id vic_* 
duplicates drop 

**Complaints can have multiple victims. We will sort wide by victim within each 
**complaint

**Sort by victim information, prioritizing the victim who is most injured
gsort cr_id -vic_inj_ vic_byr_ vic_male_ vic_race_
by cr_id: gen idx = _n
by cr_id: gen vic_tot = _N

**Keeping only the first 4 victims
drop if idx>4

**Reshape wide within victim
reshape wide vic_byr_ vic_male_ vic_race_ vic_inj_, i(cr_id) j(idx)

**Label race and injury codes
label define race_lab 1 "White" 2 "Black" 3 "Other" 
label define inj_lab 0 "Not Injured" 1 "Minor Injury" 2 "Severe Injury" 3 "Deceased"
forvalues i = 1/4{
	label values vic_race_`i' race_lab
	label values vic_inj_`i' inj_lab
}

**Save victim data
save ${rep_root}/data/vic_cms, replace
clear

**INVESTIAGORS
**Read the raw investigator data
import delimited using ${rep_root}/data/FOIA/FOIA_complaints-recent_05212021/cms_investigator.csv, varnames(1)

**Extract matching cr_id from log numbe
rename *log_no log_no
gen log_yr = substr(log_no, 1, 4)
gen log_id = substr(log_no, 6, 7)
gen cr_id = log_yr + log_id
destring cr_id, force replace
drop if missing(cr_id)

**Give each investigator a unique ID, taking care to ensure they are harmonized
**with the IDs already used in the pre-CMS data.
preserve
sort first_name last_name
by first_name last_name: keep if _n==1
gen inv_id = _n

keep first_name last_name inv_id

**Temporarily rename the new investigator IDs
rename inv_id inv_cms

**Add on original investigator IDS
merge 1:1 first_name last_name using ${rep_root}/data/inv_id
drop if _merge == 2

**Use original investigaor ID if there is a match. Otherwise, use
**5000+the new ID, to ensure no clashes
replace inv_cms = 5000+inv_cms if _merge == 1
replace inv_id = inv_cms if _merge == 1
drop _merge inv_cms

save ${rep_root}/data/inv_id_cms, replace
restore

**Merge investigator IDs back to the main data
sort first_name last_name
merge m:1 first_name last_name using ${rep_root}/data/inv_id_cms, nogen

**Record investigator unit
destring current_unit_assigned, force gen(inv_unit)

**Record whether investigator is marked as a supervisor on this case
gen inv_supervisor = 0
replace inv_supervisor = 1 if strpos(current_position, "SUPERVISING")>0
replace inv_supervisor = . if missing(current_position) | current_position=="NULL"

**Record assignment date
gen inv_assigned_dt = date(assign_datetime, "MDY", 2025)

**Indicator for investigator is male
gen inv_male = gender=="M"
replace inv_male = . if missing(gender) | gender=="NULL"

**Simple race category for investigator (white, black, other)
gen inv_race = .
replace inv_race = 1 if race=="WHI" | race=="WWH"
replace inv_race = 2 if race=="BLK"
replace inv_race = 3 if !missing(race) & race!="NULL" & missing(inv_race)

**Record investigator birthyear
destring birth_year, force gen(inv_byr)

**Record investigator names
rename first_name inv_first
rename last_name inv_last

**Keep just cr_id and created variables
keep cr_id inv_* 
duplicates drop 

**Most cases are assigned to more than one investigator. We will attempt to isolate
**the first supervising investigator here

**Sort by assignment date, then by whether investigator is a supervisor
gsort cr_id inv_assigned_dt -inv_supervisor inv_id

**Record when a case is transferring from a supervisor to a non-sueprvisor, 
**then prioritize the first such transitision
by cr_id: gen sup_to_nonsup = inv_supervisor==1 & inv_supervisor[_n+1]==0
gsort cr_id -sup_to_nonsup inv_assigned_dt inv_id
drop sup_to_nonsup

**Mark cases assigned to multiple supervisors in a single day
by cr_id: gen inv_simul_assign = inv_assigned_dt[_n]==inv_assigned_dt[_n+1] & inv_supervisor[_n]==1 & inv_supervisor[_n+1]==1 & !missing(inv_assigned_dt)

**Restrict to just the prioritized investigator, who is the first assigned supervisors
**who then gave the case to a non-supervisor that we observe
by cr_id: keep if _n==1

**Label race category
label define race_lab 1 "White" 2 "Black" 3 "Other" 
label values inv_race race_lab

**Save investigator data
save ${rep_root}/data/inv_cms, replace
clear

**ACCUSED
**Read the raw accused data
import delimited using ${rep_root}/data/FOIA/FOIA_complaints-recent_05212021/cms_accused_with_appt_date.csv, varnames(1)

**Save original ordering of the dataset to break ties later
gen row_no = _n

**Extract matching cr_id from log numbe
rename *log_no log_no
gen log_yr = substr(log_no, 1, 4)
gen log_id = substr(log_no, 6, 7)
gen cr_id = log_yr + log_id
destring cr_id, force replace
drop if missing(cr_id)

**Record name and appointment date of accused office
gen acc_fname = trim(first_name)
gen acc_lname = trim(last_name)
gen acc_mi = trim(middle_initial)
gen acc_appoint_dt = date(appointed_date, "MDY", 2025)

**Record indicator for white, indicator for male male, birthyear, position and race of accused officer
gen acc_white = inlist(race, "WHI", "WWH", "White")
replace acc_white = . if missing(race) | race=="NULL"

gen acc_male = gender=="M" | gender=="Male"
replace acc_male = . if missing(gender) | gender=="NULL"

destring birth_year, gen(acc_byr) force

gen acc_rank = upper(position_at_complaint)

rename race acc_race

**Generate an accused ID to distinguish multiple officers named in the same complaint
sort cr_id acc_fname acc_lname acc_mi acc_appoint_dt
by cr_id acc_fname acc_lname acc_mi acc_appoint_dt: gen mark = _n==1
by cr_id: gen acc_id = sum(mark)
drop mark

**Create numerical version of number of days suspended
gen no_of_days = substr(final_penalty, 1, 3)
destring no_of_days, force replace
replace no_of_days = . if strpos(final_penalty, "SUSPENSION")==0

**Record if the last finding in the history of a given allegation is Sustained
**or (sustained + not sustained)
**mark if an officer has any allegations so sustained and the maximum of the
**number of days associated with any sustained allegation
sort cr_id acc_id allegation_category_code row_no
by cr_id acc_id allegation_category_code: gen acc_sustained_i = (recommended_finding=="Sustained" | recommended_finding=="Administratively Terminated"| (no_of_days>0 & !missing(no_of_days))) & _n==_N
by cr_id acc_id allegation_category_code: gen acc_investigated_i = (recommended_finding=="Sustained" | recommended_finding=="Administratively Terminated"| recommended_finding=="Not Sustained" | (no_of_days>0 & !missing(no_of_days))) & _n==_N
by cr_id acc_id allegation_category_code: gen acc_susp_days_i = no_of_days if acc_sustained_i==1
by cr_id acc_id: egen acc_sustained = max(acc_sustained_i)
by cr_id acc_id: egen acc_investigated = max(acc_investigated_i)
by cr_id acc_id: egen acc_susp_days = max(acc_susp_days_i)
drop acc_sustained_i acc_susp_days_i

**Record an alternative outcome (deprecated)
*gen acc_investigated = inlist(finding_cd, "NOT SUSTAINED", "EXONERATED", "ADDITIONAL INVESTIGATION REQUESTED")

**Record indicators for any officer information missing
gen miss_first = missing(acc_fname) | acc_fname=="NULL"
gen miss_last = missing(acc_lname) | acc_lname=="NULL"
gen miss_mi = missing(acc_mi) | acc_mi=="NULL"
gen miss_appoint = missing(acc_appoint_dt) 

**Keep the record from each accusation with the minimum amount of officer-identifying
**information missing
sort cr_id acc_id miss_last miss_first miss_appoint miss_mi
by cr_id acc_id: keep if _n==1

**Keep cr_id and created variables
keep cr_id acc_* recommended_finding
duplicates drop

**Save accused data (with potentially multiple observations per complaint 
**to include every accused officer)
save ${rep_root}/data/acc_cms, replace

**COMBINE
**Merge on investigator data, keeping only complaints for which we have investigator information
sort cr_id
merge m:1 cr_id using ${rep_root}/data/inv_cms, keep(2 3) gen(inv_merge)

**Merge on victim and incident information
sort cr_id
merge m:1 cr_id using ${rep_root}/data/vic_cms, keep(1 3) gen(vic_merge)

sort cr_id
merge m:1 cr_id using ${rep_root}/data/inc_cms, keep(1 3) gen(inc_merge)

**Create variables that indicate whether we have each kind of data
gen acc_data = inv_merge==3
gen vic_data = vic_merge==3
gen inc_data = inc_merge==3
drop inv_merge vic_merge inc_merge

**Create age of officer at accusation (only possible after merging incident and accused data)
gen acc_age = year(inc_complaint_dt_1) - acc_byr

save ${rep_root}/data/complaints_cms, replace
clear
log close
